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Recent research has focused on opinion mining from public sentiments us- 
ing natural language processing (NLP) and machine learning (ML) techniques. 
Transformer-based models, such as bidirectional encoder representations from 
transformers (BERT), excel in extracting semantic information but are resource- 
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intensive. Google’s new research, mixing tokens with fourier transform, also 
known as FNet, replaced BERT’s attention mechanism with a non-parameterized 
fourier transform, aiming to reduce training time without compromising perfor- 
mance. This study fine-tuned the FNet model with a publicly available Kaggle 
hotel review dataset and investigated the performance of this dataset in both 
FNet and BERT architectures along with conventional machine learning models 
such as long short-term memory (LSTM) and support vector machine (SVM). 
Results revealed that FNet significantly reduces the training time by almost 20% 
and memory utilization by nearly 60% compared to BERT. The highest test ac- 
curacy observed in this experiment by FNet was 80.27% which is nearly 97.85% 
of BERT’s performance with identical parameters. 


Keywords: 


Bidirectional encoder 
representations from transformers 
FNet 

Fourier transform 

Hotel reviews 

Sentiment analysis 
Transformer-based models 


This is an open access article under the CC BY-SA license. 


Corresponding Author: 


Shovan Bhowmik 

Department of Computer and Information Sciences, University of Delaware 
Newark, USA 

Email: bhowmik.sshovon5795 @ gmail.com 


1. INTRODUCTION 

Blogs, forums, and online social networks such as Facebook and Twitter have expanded due to the 
advancement of technology, providing opportunities for users to discuss any subject and express their ideas. 
They might share their opinion on current events, argue issues, or provide satisfactory comments for a product, 
for instance. This is why sentiment analysis, known as opinion mining, has become an important topic in 
machine learning research. Despite the significance of sentiment analysis and the range of applications it 
already has, there are a lot of challenges with natural language processing (NLP) that must be resolved. 

Over time, many machine learning (ML) and deep learning (DL) models have been used to improve 
the robustness of sentiment analysis tasks, including long short-term memory (LSTM), convolutional neural 
network (CNN), support vector machine (SVM) [I], [2]. Even though these conventional models have shown 
significant improvements while performing sentiment analysis and handling long texts and sequences, context- 
associated information extraction is a major challenge. To overcome these constraints, a novel approach named 
bidirectional encoder representations from transformers (BERT) [8] has been introduced by researchers, which 
has been successfully integrated into many NLP research projects to acquire remarkable performance improve- 
ment. BERT is a transformer-based pre-trained model that utilizes the self-attention mechanism and encod- 
ing feature of transformers to perform several tasks, e.g., text classification, question-answering, and relation 
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extraction. Moreover, this self-attention mechanism also facilitates the context-based approach while work- 
ing with text data and reduces the steps of preprocessing the data. However, this approach greatly increases 
training time and memory consumption heavily [4]. Several studies conducted previously have indicated that 
fourier transforms can be integrated in order to increase the speed of computation of different DL models 
such as convolutional neural networks |5], recurrent neural networks [6], and transformers [7]. Therefore, 
Google researchers have recently proposed a new model named FNet: mixing tokens with fourier transforms 
[8], which replaced the self-attention mechanism with fourier transforms. According to them, FNet achieves 
almost state-of-the-art performances of BERT with much faster training speed and memory utilization. 
Although the researchers claimed the effectiveness of this model by comparing the performance of 
FNet with other transformer-based models, it was benchmarked only for the pretraining part. In this paper, we 
applied this model as a fine-tuning task in sentiment analysis to validate their work on a sufficiently large-scale 
hotel review dataset. The summary of our project is listed here: 
— Applying the FNet model to analyze sentiments from hotel reviews. 
— Presenting detailed comparison between FNet and BERT models regarding execution time and memory 
usage. 
— Performing a comparative study between FNet, BERT, LSTM, and SVM models. 
— Conducting an ablation study with different learning rates to identify the best parameters for our FNet 
model for the hotel review dataset. 


2. RELATED WORKS 

Machine learning and deep learning are widely used in sentiment analysis tasks. To recognize senti- 
ments from COVID-19 tweet data, Jain et al. [9] used different machine and deep learning algorithms. This 
study also proposed an ensemble-based approach, which outperformed other approaches. In another work, 
Gandhi et al. used deep learning algorithms CNN and LSTM to analyze sentiments from tweeter data. 
The preprocessing step of this experiment includes Word2 Vec embedding for mapping words to vectors. In an- 
other study, Rahman et al. performed sentiment analysis on customer feedback from six mobile banking 
applications utilizing the SVM and the Naive Bayes (NB) algorithm. In their experiment, linear SVM gener- 
ated the best result with 97.17% accuracy. We also studied some research works on sentiment analysis that 
have been done in different languages, such as Arabic [12]. This research work developed an ensemble-based 
classifier to detect emotions in an Arabic context. Their ensemble-based model outperformed traditional ML 
algorithms, such as SVM, AdaBoost classifier, maximum entropy, k-nearest neighbors, decision tree, random 
forest, logistic regression, and Naive Bayes. In addition, Chiny et al. showed that the dimension of word 
mapping vector on short and long text impacts the sentiment analysis result. Their experiment showed that an 
increase in the dimension of the word vector for long text resulted in better performances. Some works used 
ML and DL models in hotel [14J-(16] and social media [17], sentiment reviews. These works include 
models such as LSTM, decision tree, SVM, and bayesian classifiers. 

Recently transformer-based models have become very popular for NLP-based tasks, e.g., sentiment 
analysis, text or review classification, and, document categorization. A transformer-based BERT model is used 
to identify harmful news [19], [20]. Additionally, models such as BERT and ELECTRA are also being used to 
identify abusive comments or texts and hate speech on social media platforms such as Facebook and Twitter 
[21], 22]. The applications of transformer-based algorithms are expanding in document classification and 
analysis [23], [24]. 

There have been some research works that used fourier analysis with convolutional neural network 
(CNN) and recurrent neural network (RNN) [6], [25]. The motivation for these research works was fast com- 
putation and approximation. Fourier transform is also used in a transformer-based model [26]-[28]. All these 
studies tried to scale or improve the attention mechanism of transformers by introducing a new approximation 
method or filter. But none of them replaced the entire attention layer and substituted it with some alternatives. 
This attention mechanism was first replaced with the fourier transform by Lee-Thorp et al. [8]. 


3. METHOD 
In this section, we will discuss the overall steps required in our study to predict sentiments from text 
data using FNet. This will be followed by a description of transformer architecture (BERT) and FNet. 
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3.1. System architecture 

We can show the overall workflow of our project in Figure 1. The dataset for this project was raw text 
data of customer reviews about hotels with two parts-positive and negative feedback with a review score. As 
there was no specific label for the texts, we preprocessed the data not only for cleansing but also by assigning 
labels based on the review score. If the review score was greater than 5 for a particular text, we considered it 
a positive sentiment; otherwise, it was negative. Then, we split the data into the train-validation-test portion. 
In this project, we tried to capture the pre-trained weights of the transformer-based models from their original 
design and fine-tuned them in our dataset for both FNet and BERT. Therefore, we tokenized the text parts 
of the dataset and prepared an input representation required to feed the data in the FNet architecture. We 
created tensors of input IDs, types of input IDs, and position embeddings for each text. This was followed by 
transferring the inputs to the FNet. It needs to be mentioned here that we created attention masks for the BERT 
and removed them while training in FNet, as the attention was built inside the FNet model through fourier 
transformation. For predicting the classes and selection of the hyperparameters, we added two classifier layers 
on top of the FNet with a sigmoid activation function. The classification layers were (768,256) and (256, 2). 
768 was the input shape for the first layer since FNet generates a vector of size 768 for the classification 
purpose from the input tokens by the computations inside the encoders and decoders of hidden transformer 
layers. The output shape of 2 in the last layer was for the number of classes. Once the prediction was over, 
we compared our model with the existing techniques. We also want to state that we generated term-frequency 
inverse document frequency (TF-IDF) features while training with SVM [29]. As it is not a contextual feature, 
we made some efforts to preprocess the data before the fitting. Those include punctuation removal, word 
tokenization, stopword removal, stemming, lower-case conversion, and detokenization. 


WORD EMBEDDING 


e Token 
e Position 
e Input Type 


e Positive part + Negative part 
e Labelling the text 


Text Data ——> Pre-process 


Dataset 


Classifier: FFNN + Sigmoid <«<——W\ FNet 


Model Comparison 


<—_ e BERT Original 
e LSTM 
e SVM (TF-IDF Features) 


Figure 1. Proposed system architecture for classifying hotel reviews 


3.2. Model description 
In the following parts, we will describe the internal architecture of our selected FNet model along with 
its similarity and dissimilarity with the transformer-based BERT architecture. 


3.2.1. Transformer based models 

Transformer-based models are bringing a revolution in NLP-based tasks. It’s designed especially for 
data with sequential inputs. The attention mechanism is considered the heart of transformer architecture. This 
is a multi-head self-attention mechanism that relates different positions of the input sequence. This architecture 
has a decoder and encoder part. The encoder part maps the input sequence to a continuous representation, and 
the decoder transforms it into an output sequence. 

BERT is a popular transformer-based model with a bidirectional transformer encoder [3]. Token 
sequences generated from single or multiple sentences can be the input for the BERT model. Figure 2 depicts 
the BERT architecture with required layers. This token sequence consists of three special BERT tokens, namely 
[CLS], [SEP], and [PAD], representing classification tokens, pair tokens, and padding tokens. Masked language 
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modeling (MLM) and next sentence prediction (NSP) are two unsupervised sub-tasks for BERT pretraining. 
These two tasks are done both in a forward and backward direction, and that’s why it is called bidirectional. 
Sub-task MLP is used to mask some of the input tokens on a random basis and tries to predict the mask 
sentences. NSP is responsible for predicting a sequence available after the [SEP] token. Thus, the prediction 
through this technique gives context-related information for different tokens. However, the tokenization in the 
BERT architecture is done with word piece tokenization and has a gigantic vocabulary of 30,522. For the 
prediction purpose of a particular class, it creates a vector of 768 lengths which can represent the reflection of 
tokens for any classes in the application where it is applied. 


3.2.2. Discrete fourier transform 
Using the fourier transform, a function is broken into its constituent frequencies. If we are given an 
input sequence £1, £2....-Ln, Where n is in the range between 0 to N-1 then DFT can be defined by (1) [8]: 


N-1 
X= Ye Or On ke NH (1) 

n=0 
Here, Xp is the new representation and sum of all input tokens. Convolution from the frequency 
domain to the time domain is represented by matrix multiplication, a technique by which DFT is computed. 
This can decrease computational costs and accelerate the convolutional neural network training process. This 
thought encourages researchers to modify transformer architecture by using fourier transformation, such as 

FNet architecture. 


3.2.3. FNet architecture 

FNet is a modification of transformer-based BERT architecture (Figure 2(a)). The modification 
implies that it is an attention-free transformer. Figure 2(b) depicts the FNet architecture with required sub- 
layers. This architecture is built in such a way that the self-attention sublayer of each transformer encoder is 
replaced with a fourier sublayer. In other words, FNet applies a 2D discrete fourier transform to its embedding 
input, both in sequence (Fseq) and hidden (Fh) dimensions. If embedding input x, fourier transform is F, and 
the real part is IR, then the fourier operation can be defined by (2). 
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Figure 2. N encoder blocks of: (a) BERT architecture and (b) FNet architecture 


Fourier transform ensures that the feed-forward sub-layer has sufficient access to all tokens. The input 
is transformed repeatedly between the temporal and frequency domains using the fourier transform sub-layer. 
Convolution in the time domain is identical to the result of a fourier transformation in the frequency domain. 
FNet may operate in this manner by alternately performing multiplications and convolutions. 
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4. RESULTS AND DISCUSSION 

In this section, we will describe the dataset used in our project, the environmental and experimental 
setup to run the investigation, the performance evaluation metrics, observed results, and the discussion of the 
findings of our experiment with multiple subsections. 


4.1. Dataset description 

The source of our dataset was the Kaggle website, and the name of our dataset is 515 K hotel reviews 
in Europe. This data was collected from the Booking.com website and is publicly available to everyone. This 
dataset consisted of a total of 515,738 customer reviews with 17 attributes. Our target attributes were positive 
review, negative review, and reviewer score. The range of the reviewer score is between 0 to 10. In our dataset, 
we found that the lowest reviewer score was 2.5, and the highest score was 10. Table 1 will present examples 
of positive and negative reviews. After preprocessing the data and labeling them, we split the dataset into train, 
validation, and test. We’ve also presented a textual statistical analysis of the dataset in Table 2. 


Table 1. Example of review text 


Review type Review text 
Positive review No real complaints the hotel was great great location surroundings rooms amenities and service... 


Negative review Rooms are nice but for elderly a bit difficult as most rooms are two story with narrow steps... 


Table 2. Statistical analysis of dataset details 


Label No. of instances Train (60%) Validation (25%) Test (15%) 
Positive 485,035 291,072 121,220 72,743 
Negative 30,703 18,370 7,715 4,618 
Total 515,738 309,442 128,935 77,361 


4.2. Experiment setup 

We performed all our investigations on Google Colab with a built-in Tesla k-80 GPU that has a mem- 
ory of 12 GB. For our experiment on this large corpora with unbalanced classes, we set up the parameters as 
maximum length of a sentence: 100, batch size: 128, optimizer: Adam, learning rate: 2e — 5, regularizer: L2 
(le — 5). 

For the ablation study on the performance of FNet, the learning rates of 3e — 5 and 4e — 5 were 
chosen, respectively. This setup was the same for BERT and LSTM to make a fair comparison. The dataset 
was trained for 4 epochs in all the models. However, as SVM used TF-IDF for fitting the features, we used 
standard parameters for the SVM model. 

As FNet is a modified architecture of BERT, we followed the approach shared for classifying texts 
using BERT and modified this with the FNet weights published in the huggingface website. We removed 
the attention weights by ourselves while preparing the input representation for the FNet fine-tuning. For the 
other baseline models (LSTM and SVM), we followed the general approach publicly available in the research 
community. 


4.3. Performance evaluation 

As FNet is almost similar to the BERT architecture with a change in the configuration of the self- 
attention mechanism [8], the focus on selecting performance measurement metrics was solely based on the 
original FNet paper. So, we considered training time and memory to compare with BERT. We also measured 
accuracy, precision, recall, and Fl-score for the BERT, LSTM, and SVM models selected in this work. These 
models are selected because of their extensive usage in the field of NLP [30]. 


4.4. Results 

In this section, we showed our obtained results for various performance measurement metrics. To val- 
idate the FNet researcher’s assertion, we have compared the training time and memory utilization comparison 
between FNet and BERT in Table 3. The FNet model took around 160 minutes for overall training, validation, 
and test execution for our dataset, whereas BERT took 199 minutes. Therefore, 19.6% of the time was saved 
for our model. Moreover, memory consumption was 60.4% less than the BERT (base) model. In Table 4, we 
portrayed the performance comparison between our selected models. 
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Table 3. Time and memory comparison between FNet and BERT 
Model name Training time Memory utilization 
BERT 119.4 minutes 11.62 GB 
FNet 96 minutes 4.60 GB 


Table 4. Comparison of classification results in different models 
Model name Accuracy (%) Precision (%) Recall Fl-score (%) 


FNet 80.27 79.31 80.47 80.93 
BERT 82.03 82.29 82.43 81.37 
LSTM 93.96 77.13 49.68 48.16 
SVM 90.19 49.36 49.10 49.17 


BERT is always the first choice for state-of-the-art applications in NLP [B], achieving better precision, 
recall, and Fl-score values. Although LSTM and SVM with TF-IDF had very good testing accuracy, the 
performance was poor for the other performance measurement metrics. We also observed that FNet performed 
as stated in the paper, and the results were close to the BERT performance with a test accuracy of 80.27%. 

We showed the ablation study for our FNet model with different learning rates. The results of this 
study can be found in Table 5. 4e — 5 became the best selection of learning rate in our experimental setup 
compared to the other two learning rates as this provided better performance in accuracy, precision, recall, and 
Fl-score. Figures 3 and 4 present the accuracy and loss values between our FNet model and the BERT for four 
epochs, respectively. 


Table 5. Ablation study for FNet with different learning rates 
Learning rate Accuracy (%) Precision (%) Recall (%) Fl-score (%) 


2e-5 80.27 79.31 80.47 80.93 
3e-5 81.28 81.21 81.15 80.97 
4e-5 81.96 82.32 81.86 82.17 


4.5. Discussions 

Transformer models have a powerful feature of context-based tokenization and feature selection absent 
in the traditional NLP algorithms [BI]. Although BERT is still the best option in NLP tasks so far, due to its 
expensive computation, we have tried to apply the recent FNet model in the hotel review dataset for sentiment 
analysis. As we removed the attention mask for the input texts for the training purpose in the FNet, it utilizes the 
fourier transform to convert the nonlinear text features to linear features and hence works as an unparameterized 
attention similar to the masking of BERT. Because of this, we have fewer features to train. This was reflected 
in our result in Table 3. Besides, we also needed less tokenization time in FNet compared to BERT. Thus, 
approximately 30% of the overall memory was less used for the tokenization of FNet. 

While comparing the performance of the FNet with other models in Table 4, we also found that LSTM 
and SVM had poor precision, recall, and Fl-score values. The possible reason could be the uneven distribution 
of the classes in the dataset without using cross-validation as well as taking parameters similar to the FNet for 
a non-discriminative analysis. Nonetheless, we attained generalization for both BERT and FNet architecture. 
This can be understood from Table 4 performance metrics and Figures 3(a)-(b) and 4(a)-(b). It is because 
transformer models can distribute the weights of different classes to the optimizer without getting partial to any 
specific classes [32]. Furthermore, we could only run the training and model selection for 4 epochs. As the 
curves show generalized patterns, we can say that our model was well-fitted and could be improved with the 
increase of epochs. Another reason for taking 4 epochs is that BERT pre-trained was performed for 4 epochs, 
and these models can converge easily in these epochs. 

The ablation study shows that a learning rate of 4e—5 provided a better result in this classification task. 
However, we can get the best parameters by some hyperparameter tuning operations. Overall, the experiment 
shows that training time reduction and memory utilization minimization can be helpful in training large corpus, 
and models such as FNet could achieve performance similar to the sophisticated models even without using 
attention masks. 

Among the transformer-based models that we studied in this research, fine-tuning FNet is more cost- 
effective compared to the BERT architecture. Because of the unparameterized transformation through fourier 
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analysis, FNet omits the multi-head attention mechanism, which works with all the tokens generated from the 
word embedding in the BERT architecture. Therefore, it reduces time and memory consumption simultane- 
ously because of the normalization layers, and it can easily map the nonlinear transformations of token weights 
accurately. From the results, we can also see that with an increasing learning rate, it can perform similarly to 
BERT models. As resource optimization is a vital element while working on large amounts of data, FNet can 
significantly be utilized in NLP tasks such as question answering, name-entity recognition, and relation extrac- 
tion. Therefore, the goal of this research is also to extend the application of FNet in the text-mining community. 
In the future, research can also be done on tuning different layers inside the FNet architecture, for example, 
changing the number of DFT operations for each input representation, changing the number of hidden neurons 
inside the feed forward and dense layers, to observe the performance variation in the text mining area. This will 
help us to understand the impact of using fourier transformation in analyzing text features and applications. 


BERT -> Train vs Validation vs Test Accuracy ür FNet -> Train vs Validation vs Test Accuracy 


-e Tain 
=e- Validation 


Figure 3. Accuracy curves for (a) BERT and (b) FNet models over four epochs 


BERT -> Train vs Validation vs Test Loss FNet -> Train vs Validation vs Test Loss 


—e Tain 
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—e Est 


(a) (b) 
Figure 4. Loss curves for (a) BERT and (b) FNet models over four epochs 


5. CONCLUSION 

In this research, we implemented the FNet model to classify hotel reviews to extract customer sen- 
timents. Our main focus was to compare the FNet model with the BERT model since FNet is a modification 
of the BERT model. We have found that the FNet model significantly reduced memory usage (60%) and ex- 
ecution time (20%) compared to BERT. The highest test accuracy observed in this experiment by FNet was 
80.27% which is almost 97.85% of BERT with the same parameter setting. We also compared the performance 
of the FNet model with classic ML models like SVM and LSTM and found that FNet achieved state-of-the-art; 
we also presented an ablation study in which we demonstrated the performance of the FNet model in terms of 
different learning rates. For learning rate 4e-5, we have achieved the best result with a test accuracy of 81.96%. 
For future studies, we wish to conduct this research with more diverse datasets, besides other transformer-based 
models like ROBERT, ALBERT, and XLNet. 
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